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Abstract. We propose a new subgradient method for the minimization 
of nonsmooth convex functions over a convex set. To speed up compu- 
tations we use adaptive approximate projections only requiring to move 
within a certain distance of the exact projections (which decreases in the 
course of the algorithm). In particular, the iterates in our method can be 
infeasible throughout the whole procedure. Nevertheless, we provide con- 
ditions which ensure convergence to an optimal feasible point under suit- 
able assumptions. One convergence result deals with step size sequences 
that arc fixed a priori. Two other results handle dynamic Polyak-type 
step sizes depending on a lower or upper estimate of the optimal ob- 
jective function value, respectively. Additionally, we briefly sketch two 
applications: Optimization with convex chance constraints, and finding 
the minimum £i-norm solution to an underdeterniined linear system, an 
important problem in Compressed Sensing. 

March 12, 2013 

1 Introduction 

The projected subgradient method [49] is a classical algorithm for the minimiza- 
tion of a nonsmooth convex function / over a convex closed constraint set X, 
i.e., for the problem 

min f{x) s. t. x G X. (1) 

One iteration consists of taking a step of size Uk along the negative direction of 
an arbitrary subgradient h'^ of the objective function / at the current point x'^ 
and then computing the next iterate by projection iVx) onto the feasible set X: 

Over the past decades, numerous extensions and specializations of this scheme 
havc^ bcx^n dciveloped and proven to converge to a minimum (or minimizer) . Well- 
known disadvantages of the subgradient method are its slow local convergence 
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and the necessity to extensively tune algorithmic parameters in order to obtain 
practical convergence. On the positive side, subgradient methods involve fast 
iterations and are easy to implement. In fact, they have been widely used in 
applications and (still) form one of the most popular algorithms for nonsmooth 
convex minimization. 

The main effort in each iteration of the projected subgradient algorithm usu- 
ally lies in the computation of the projection Vx- Since the projection is the 
solution of a (smooth) convex program itself, the required time depends on the 
structure of X and corresponding specialized algorithms. Examples admitting 
a fast projection include the case where X is the nonnegative orthant or the 
^i-norm-ball {a; | ||a;||i < r}, onto which any x G H" can be projected in 
0{n) time, see [50]. The projection is more involved if X is, for instance, an 
afhne space or a (convex) polyhedron. In these latter cases, it makes sense to 
replace the exact projection Vx by an approximation Vx- That is, we do not 
approximate the projection operator uniformly, but, for a given x, we approxi- 
mate the projected point adaptively up to a desired accuracy. This is formalized 
by computing points Vx{x) with the property that ll'P^(x) —Vx{x)\\ < e for 
every e > 0. This concept of an absolute accuracy of the projected point is sim- 
ilar in spirit to the adaptive evaluation of operators as, e.g., used in adaptive 
wavelet methods (cf. the APPLY-routine in [13]). Algorithmically, the idea is 
that during the early phases of the algorithm we do not need a highly accurate 
projection, and Vx{x) can be faster to compute if e is larger. In the later phases, 
one then adaptively tightens the requirement on the accuracy. 

One particularly attractive situation in which the approach works is the case 
where X is an afhne space, i.e., defined by a linear equation system. Then one 
can use a truncated iterative method, e.g., a conjugate gradient (CG) approach, 
to obtain an adaptive approximate projection. Wc have observed that often only 
a few steps (2 or 3) of the CG-procedure are needed to obtain a practically 
convergent method. 

In this paper, we focus on the investigation of convergence properties of a 
general variant of the projected subgradient method which relies on such adap- 
tive approximate projections. We study conditions on the step sizes and on the 
accuracy requirements s/. (in each iteration k) in order to achieve convergence of 
the sequence of iterates to an optimal point, or at least convergence of the func- 
tion values to the optimum. We investigate two variants of the algorithm. In the 
first one, the sequence (ak) of step sizes forms a divergent but square-summable 
series (^ctk = oo, 5^ a| < oo) and is given a priori. The second variant uses 
dynamic step sizes which depend on the difference of the current function value 
to a constant target value that estimates the optimal value. 

A crucial difference of the resulting algorithms to the standard method is 
the fact that iterates can be infeasible, i.e., are not necessarily contained in X. 
We thus call the algorithm of this paper infeasible-point subgradient algorithm 
(ISA). As a consequence, the objective function values of the iterates might 
be smaller than the optimum, which requires a non-standard analysis; see the 
proofs in Section 3 for details. Moreover, we always assume that X is strictly 
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contained in the interior of the domain dom/ of /. Note that this excludes 
the case X = dom/, where our algorithm cannot be applied. Furthermore, we 
assume that every iterate lies in dom /, since otherwise no first-order information 
is available. This is automatically fulfilled if dom / is the whole space, or it can 
be ensured by requiring that the accuracies Sk are small enough; cf. also Part 4 
of Remark 3. 

This paper is organized as follows. We first discuss related approaches in the 
literature. Then we fix some notation and recall a few basics. In the main part 
of this paper (Sections 2 and 3), we state our infeasible-point subgradient algo- 
rithm (ISA) and provide proofs of convergence. In the subsequent sections we 
briefiy discuss some variants of ISA, an example for the adaptive approximate 
projection operator from the context of convex chance constraints, and an appli- 
cation of ISA to the problem of finding finding the minimum ^i-norm solution 
of an underdetermined linear equation system, a problem that lately received a 
lot of attention in the context of compressed sensing (see, e.g., [17, 10, 15]). We 
finish with some concluding remarks and give pointers to possible extensions as 
well as topics of future research. 

1.1 Related work 

The objective function values of the iterates in subgradient algorithms typically 
do not decrease monotonically. With the right choice of step sizes, the (projected) 
subgradient method nevertheless guarantees convergence of the objective func- 
tion values to the minimum, see, e.g., [49,44,5,46]. A typical result of this sort 
holds for step size sequences (a^) which are nonsummable {Y^^^qOL)- = oo), 
but square-summable (X^^lo '^t < °°)- Thus, a/; ^ as fc — )• oo. Often, the 
corresponding sequence of points can also be guaranteed to converge to an opti- 
mal solution X* , although this is not necessarily the case; see [3] for a discussion. 

Another widely used step size rule uses an estimate of the optimal value /* , 
a subgradient h'' of the objective function / at the current iterate x'^ , and re- 
laxation parameters Afe > 0: 

The parameters are constant or required to obey certain conditions needed 
for convergence proofs. The dynamic rule (2) is a straightforward generalization 
of the so-called Polyak-typc step size rule, which uses = /*, to the more 
practical case when /* is unknown. The convergence results given in [2] extend 
the work of Polyak [44, 45] to (p> f* and 9? < /* by imposing certain conditions 
on the sequence (A^). Wc will generalize these results further, using an adaptive 
approximate projection operator instead of the (exact) Euclidean projection. 

Many extensions of the basic subgradient scheme exist, such as variable target 
value methods (sec, e.g., [14,28,36,40,48,19,5]), using approximate subgradi- 
ents [6,1,34,16], or incremental projection schemes [23,40,31], to name just a 
few. 
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Inexact projections have been used previously, probably most prominently for 
convex feasibility problems in the framework of successive projection methods. 
Indeed, the optimization problem (1) can, at least theoretically, be cast as the 
convex feasibility problem to determine x* € X r\{f{x) < /*}. Using so-called 
subgradient projections [4] onto the second set leads to a subgradient step 

„fc+i ._ k _ f{x^) - f* ik 

||^fe||2 ^' 

which corresponds to using a Polyak-type step size without relaxation parameter, 
employing the exact optimal value. As illustrated in [4], this approach leads to a 
very flexible framework for convex feasibility problems as well as (non-smooth) 
convex optimization problems. 

Moreover, [52] considers additive vanishing non-summable error terms (for 
both the projection and the subgradient step) and establishes the existence of a 
(decaying) bound on the error terms such that the algorithm will reach a small 
neighborhood of the optimal set. However, these bounds are not given explicitly. 
In contrast, our results (Theorems 1 and 3) contain explicit conditions for the 
error terms that guarantee convergence to the optimum. Another example for the 
use of inexact projections is the level set subgradient algorithm in [30], although 
there, all iterates are strictly feasible. 

We emphasize that there are at least three conceptually different approaches 
to approximate projections in the present context. The first concept — prominent, 
e.g., in the field on convex feasibility problems — uses the idea of approximating 
the direction towards the feasible set, i.e., the iterates approximately move to- 
wards the constraint set. In the second, related, concept one projects exactly 
onto supersets of the constraint which are easier to handle, e.g., half-spaces. 
With both ideas one can use powerful notions like Fejer-monotonicity or the 
concept of firmly non-expansive mappings, see, e.g., [4] and the more recent [35]; 
see also the "feasibility operator" framework proposed in [23]. To employ either 
approach one exploits analytical knowledge about the feasible set, e.g., that it 
can be written as a level set of a known and easy-to-handle convex function. 
In the third approach, one aims at approximating the projected point without 
further restricting the direction. This concept applies, for instance, in situations 
in which a computational error is made in the projection step (e.g., as in [52]) or 
when it is impossible or undesirable to handle the constraints analytically, but 
a numerical algorithm is available which calculates the projection point up to a 
given accuracy. The adaptive approximate projections considered in this paper 
fall under this third category. 

Note that, besides the different philosophies and fields of application, none 
of the approaches directly dominates the other: On the one hand, one may 
move directly towards the feasible set while missing the projection point, and 
on the other hand, one may also move closer to the projected point along a 
direction which is not towards the feasible set; sec Figure 1 for an illustration. 
However, one can sometimes, for a given rule which approximates the projection 
direction, find appropriate half-spaces which contain the feasible set and realize 
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Fig. 1. Schematic illustration of the three concepts of "approximate projections": The 
approximation of the projection direction (or "moving towards the feasible set" ) moves 
from X along a direction within the shaded cone. The exact projection onto a half-space 
containing X moves along the dashed line. The approximation of the projected point 
moves from x into a neighborhood of Px{x), the shaded circle. 



this approximate projection exactly. In Section 5 we give a concrete example 

in which the Fejer-type feasibility operator of [23] is not applicable, but the 
exact projection point can be approximated reasonably well in the sense of our 
adaptive approximate projection (see above or (7) below). 

In the present paper we only consider the third approach to approximate 
projections and do not use any assumption like non-expansiveness or Fejer- 
monotonicity for the iteration mapping in our convergence analyses. 



1.2 Notation 

In this paper, we consider the convex optimization problem (1) in which we 
assume that / : IR" — >-lRU{oo}isa convex function (not necessarily differen- 
tiable), dom/ = {a; G IR" | f{x) < oo}, and X c int(dom/) C ]R" is a closed 
convex set (note that this implies that / is continuous on X). The set 

df{x):={h€JR^\f{y)>f{x) + h^{y-x) VyeR"} (3) 

is the subdijjerential of / at a point x G M"; its members arc the corresponding 
subgradients. Throughout this paper, we will assume (1) to have a nonempty set 
of optima 

X* := argmin{/(a;) \xeX}. (4) 

An optimal point will be denoted by x* and its objective function value f{x*) 
by /*. For a sequence (a;*') = {x° ,x^ , x^ , . . . ) of points, the corresponding se- 
quence of objective function values will be abbreviated by (/fe) = (/(a;*^)). 
By II "lip we denote the usual ^p-norm, i.e., for x G H", 

(Er=il^ih% ifl<P<oo, 
max |a;i|, it p = oo. 

i—l,....n 
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If no confusion can arise, we shall simply write ||-|| instead of ||-||2 for the Eu- 
clidean (^2-)norm. The Euclidean distance of a point a; to a set Y is 

drix) := inf ||a: - y\\2. (6) 

For Y closed and convex, (6) has a unique minimizer, namely the orthogonal 
(Euclidean) projection of x onto Y, denoted by Vyix). 

All further notation will be introduced where it is needed. 

2 The Infeasible-Point Subgradient Algorithm (ISA) 

In the projected subgradient algorithm, we replace the exact projection Vx by 
an adaptive approximate projection. We require that we can adapt the accuracy 

of the approximation of the projected point absolutely, i.e., that for any given 
accuracy parameter e > 0, the adaptive approximate projection Vx '■ IR" IR" 
satisfies 

ll'PxG^) - 'Pxix)]] < e for all x e H". (7) 
In particular, for e = 0, we have Vx — ■ Note that Vx {x) does not necessarily 
produce a point that is closer to Vx{x) (or even to X) than x itself. In fact, 
this is only guaranteed for e < dx{x). 

One example arises in the context of convex chance constraints and is dis- 
cussed in Section 5.1. For the special case in which X is an affine space, we give 
a detailed discussion of an adaptive approximate projection satisfying the above 
requirement in Section 5.2. 

By replacing the exact by an adaptive projection in the projected subgradient 
method, wc obtain the Infeasible-point Subgradient Algorithm (ISA), which we 
will discuss in two variants in the following. 

The stopping criteria of the algorithms will be ignored for the convergence 
analyses. In practical implementations, one would stop, e.g., if no significant 
progress in the objective (or feasibility) has occurred within a certain number of 
iterations. 

2.1 ISA with a predetermined step size sequence 

If the step sizes {ak) and projection accuracies (e^) are predetermined (i.e., 
given a priori), we obtain Algorithm 1. Note that h'^ = might occur, but does 
not necessarily imply that x'' is optimal, because x'^ may be infeasible. In such 
a case, the adaptive projection will change x'^ to a different point as soon as Sk 
becomes small enough. 

We will now state our main convergence result for this variant of the ISA, 
using fairly standard step size conditions. The proof is provided in Section 3. 

Theorem 1 (Convergence for predetermined step size sequences). 

Let the projection accuracy sequence (sk) be such that 

Efe > 0, ^ £fe < OO, (8) 

k=0 
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Algorithm 1 PREDETERMINED Step Size ISA 

Input: a starting point a", sequences (ofc), (sk) 
Output: an (approximate) solution to (1) 

1: initialize k := 

2: repeat 

3: choose a subgradient ft* £ df{x'°) of / at 

4: compute the next iterate a;''+^ := V^^ (x'' - ath'') 

5: increment k :— k + 1 

6: until a stopping criterion is satisfied 



let the positive step size sequence (ak) be such that 

oo oo 

^afc = oo, <oo, (9) 

fc=0 k=0 

and let the following relation hold: 

oo 

ak>J2^j Vfc = 0,1,2,... (10) 

j=k 

Suppose Wh'^W < H < oo for all k. Then the sequence of the ISA iterates {x'^) 
converges to an optimal point. 

Remark 1. Relations (8), (9), and (10) can be ensured, e.g., by the sequences 
£fe = l//c^ and afe = 1/(A: — 1) for /c > 1; in particular, 

^ /"OO 2^ 

Y^Sk < / —dx=-—=ak. 
2.2 ISA with dynamic step sizes 

In order to apply the dynamic step size rule (2), we need several modifications of 
the basic method, yielding Algorithm 2. This algorithm works with an estimate (p 
of the optimal objective function value /* and essentially tries to reach a feasible 
point x'^ with f{x'') < (f. (Note that if ^ = /*, we would have obtained an 
optimal point in this case.) 

Remark 2. A few comments on Algorithm 2 are in order: 

1. Since < 7 < 1, 7^ ^ (strictly monotonically) for £ — >■ cxd. Thus, Steps 3-7 

constitute a projection accuracy refinement phase, i.e., an inner loop in which 
the current k is temporarily fixed, and x'^ is recomputed with a stricter 
accuracy setting for the adaptive projection. This phase either leads to a 
point showing (p > f* (by termination or convergence in the inner loop over 
i) or eventually resets x*^ to a point with fk > f and /i*^ ^ so that the 
regular (outer) iteration is resumed (with k no longer fixed). 
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Algorithm 2 Dynamic Step Size ISA 



Input: estimate ip of /*, starting point x° , sequences (Afc), (e/t), parameter 7 e (0, 1) 
Output: an (approximate) solution to (1) 

1: initialize k := Q, I = —1, := , := 0, q_i := 0, £_i := £0 

2: repeat 

3: choose a subgradient £ df{x'°) of / at x^ 
4: if /fc < or ft*' = then 
5: if a;'= G X then 

6: stop (at feasible point x'° showing ip> f; optimal if h'' = 0) 

7: increment l--£ + l, reset a;* := 'Px(a;''"^ - afc_i/i''"^) for e = 7^efc_i 

8: go to Step 3 

9: compute step size ak Xk{fk — v)/\\h''\\^ 
10: compute the next iterate a;*+^ ■- T'x^ix'' - auhf') 
11: reset ( := Q and increment A: := k + 1 

12: until a stopping critcrioii is satisfied 



2. Note that, if a;° is such that /o < <^ or /i*^ = 0, the algorithm begins with 
such a refinement phase, projecting more and more accurately until nei- 
ther case holds any longer (if possible); the initializations with counter — 1 
are needed for this eventuality. Moreover, we could clearly postpone the (re- 
peated) determination of a subgradient (Step 3) in a refinement phase until 
f k > If is achieved, i.e., = Q would be the only reason for another accuracy 
refinement. This may be important in practice, where finding a subgradient 
sometimes is expensive itself, and the case h'^ = presumably occurs very 
rarely anyway. For the sake of brevity we did not treat this explicitly in 
Algorithm 2. 

3. There are various ways in which the accuracy refinement phase could be re- 
alized. Instead of (7^) with constant 7 € (0, 1), any (strictly) monotonically 
decreasing sequence (7^) could be used. Since we will need £fe — >■ to achieve 
feasibility (in the limit) anyway, which implies that for all k there always 
exists some L > such that Bk+L < e/c, we could also use min{efe_i, Sk^i+g} 
as the recalibrated accuracy. Moreover, we do not need to fix k, i.e., re- 
peatedly replace x'^ by finer approximate projections, but could produce a 
finite series of identical iterates (each reset to the last one before the in- 
ner loop started) until the refinement phase is over. Similarly, we could use 
ak = ma,x{0, Xkifk - 'P)/\\h''\\'^} (and if h'' = 0); letting then 
naturally implements the refinement, while in iterations with = 0, the 
produced point may move up to Sk away from the optimal set. Assuming 
(sk) is summable, this does not impede convergence. For all these variants, 
analogues to the following convergence results hold true as well; however, 
the proofs require some extensions to account for the technical differences to 
the variant we chose to present, which admitted the overall shortest proofs. 
In practice, we would generally expect these variants to behave similarly. 
Furthermore, note that in principle, the "problematic" cases could also be 
treated by reverting to exact projections; however, in our present context 
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this should be avoided since computing the exact projection is considered 
too expensive. 

We obtain the following convergence results, depending on whether over- or 
underestimates /*. The proofs are deferred to the next section. 

Theorem 2 (Convergence for dynamic step sizes with overestimation). 

Let the optimal point set X* be bounded, (^>/*,0<Afc</3<2 for all k, and 
J2kLo -^k = 00- Let (z/fe) be a nonnegative sequence with Xifelo ^fe < '^'^'^ 



If the subgradients h'' satisfy < H < Wh'^W < H < oo and (sk) satisfies 
< fffe < minjefc, Uk} for all k, then the following holds. 

(i) For any given S > there exists some index K such that f{x^) < ip + 5. 

(ii) // additionally f{x^) > </? for all k and if Xk ~^ 0, then fk^f for k ^ oo. 

Remark 3. 

1. The sequence (i^j.) is a technicality needed in the proof to ensure — > 0. 
Note from (11) that Efc > as long as ISA keeps iterating (in the main 
loop over k), since fk > <f is then guaranteed by the adaptive accuracy 
refinements and < Afe < 2 holds by assiimption. 

2. More precisely, part (i) of Theorem 2 essentially means that after a finite 
number of iterations, we reach a point x'' with f* —c< f{x^) < ip + 6 for any 
c > 0. li if < fix'') < f + this point may still be infcasiblc, but the closer 
/(x*^) gets to (fi, the smaller Sk becomes, i.e., the algorithm automatically 
increases the projection accuracy. On the other hand, termination in Step 6 
implies that f{x'') > f * (since x'' is then feasible), and if some inner loop 
is infinite, then the refined projection points converge to a feasible point. 
Hence, for every c > 0, there is some integer < L < oo such that after the 
L-th accuracy refinement and replacement of x*^, fix'') > f* ^ c. 

3. Part (ii) shows what happens when all function values f{x'') stay above the 
overestimate of /* — which particularly holds true after possible refine- 
ments, if all the accuracy refinement phases are finite (and no termination 
occurs) — and we impose — for A; — > oo: We eventually obtain /(x'^) ar- 
bitrarily close to (fi, with vanishing feasibility violation as A; — >■ oo. Then, as 
well as in case of termination in Step 6 or convergence in a refinement phase 
(£ — >■ oo), it may be desirable to restart the algorithm using a smaller ip; see 
Section 4.2. 

4. The conditions \\h''\\ > H_> 0, for all fc, in Theorem 2 imply that all sub- 
gradients used by the algorithm are nonzero. These conditions are often au- 
tomatically guaranteed, for example, if X is compact and no unconstrained 




(11) 
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optimum of / lies in X. In this case, \\h\\ > H > for all h G df{x) and 
X G X. Moreover, the same holds for a small enough open neighborhood 
of X. Also, the norms of the subgradients are bounded from above. Thus, 
if we start close enough to X and restrict Sk to be small enough, the con- 
ditions of Theorem 2 are fulfilled. Another example in which the conditions 
are satisfied appears in Section 5.2. 

Theorem 3 (Convergence for dynamic step sizes with underestima- 
tion). Let the set of optimal points X* be hounded, ip < f* , < Xk < (3 < 2 for 
all k, and Y^'^=o = oo. Let (uk) be a nonnegative sequence with X^^q i^k < oo, 
let 

Afe(2-/3)(/fc-y)) ^ f,_P_(r* f.^^ 

^k- (/ -h + jzr^U (12) 

and let 



Xk{fk-(P) , , ^ k^\ . f>'kifk-yy) ^'^ 



If the subgradients h'' satisfy < H < \\h''\\ < H < oo and (sk) satisfies 
< £fc < min{|£fc|, Uk} for all k, then the following holds. 

(i) For any given S > 0, there exists some K such that fx < /* + 23^(/*~'^)+<^- 

(ii) // additionally Xk — > 0, then the sequence of objective function values (fk) of 
the ISA iterates {x^) converges to the optimal value f*. 



Remark 4- 



1. If /(.T*^) < f < f*, Steps 3-7 ensure that after a finite number of projection 
refinements x'' satisfies (f < f{x^). Thus, the algorithm will never terminate 
with Step 6 and every refinement phase is finite. 

2. Moreover, infeasible points x'^ with ip < f{x^) < f* are possible. Hence, the 
inequality in Theorem 3 (i) may be satisfied too soon to provide conclusive 
information regarding solution quality. Interestingly, part (ii) shows that 
by letting the parameters (Afe) tend to zero, one can nevertheless establish 
convergence to the optimal value /* (and dx{x'') < dx*{x'') 0, i.e., 
asymptotic feasibility). 

3. Theoretically, small values of /3 yield smaller errors, while in practice this 
restricts the method to very small steps (since < 0), resulting in slow 
convergence. This illustrates a typical kind of trade-off between solution 
accuracy and speed. 

4. The use of \ek\ in Theorem 3 avoids conflicting bounds on Sk in case Lk > 0. 
Because < £fc < J^fe holds notwithstanding, < — > is maintained. 

5. The same statements on lower and upper bounds on \\h'^\\ as in Remark 3 
apply in the context of Theorem 3. 
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3 Convergence of ISA 

Prom now on, let (x*^) denote the sequence of points with corresponding objective 
function values (fk) and subgradients {h''), h'' G df{x^), as generated by ISA in 
the respective variant under consideration. 

Let us consider some basic inequalities which will be essential in establishing 
our main results. The exact Euclidean projection is nonexpansive, therefore 

\\Vx{v)-x\\<\\y-x\\ VxeX (14) 

Hence, for the adaptive approximate projection Vx we have, by (7) and (14), 
for all a; e X 

WP'xiy) - A\ = Wmv) - Vx{y) + Vxiv) - A\ 

< \\V'x{y)-Vx{v)\\ + \\Vx{y)-x\\<e+\\v-x\\. (15) 

At some iteration fc, let x''^^ be produced by ISA using some step size Uk and 
write y'^ := x'^ — ai-h'^. We thus obtain for every x G X: 

||a;'=+i-a;||2 = ||7'^*(/)-,x||2 

< (11/ - x\\ + Ekf = U-xr+2 11/ - x\\ Ek + el 

= Wx'' - xf -2au{h^y{x^ -x) + al + 2 ||/ - x|| eu + 4 

< - xf -2akifk - fix)) + al \\h''f + 2\\x'' - x\\sk + 2ak £fe||/i'=|| + 4 

= ||x^^ - xf - 2 Qfe(/fc - fix)) + iak Wh'^W + Skf + 2 \\x^ - x\\ Sk, (16) 

where the second inequality follows from the subgradient definition (3) and the 
triangle inequality. Note that the above inequalities (14)-(16) hold in particular 
for every optimal point x* € X* . 

3.1 ISA with predetermined step size sequence 

The proof of the convergence of the ISA iterates x*^ is somewhat more involved 
than for the classical subgradient method as, e.g., in [49]. This is due to the 
additional error terms by adaptive approximate projection and the fact that 
fk ^ /* is not guaranteed since the iterates may be infcasible. 

Proof of Theorem 1. Wc rewrite the estimate (16) with x = x* G X* as 

ll^fc+l _ < 11^;'= - x*f -2ak ifk - f*) + iakWh'W + + 2 llx'^ - X*|| £fc 



and obtain (by applying (17) for = 0, . . . , m) 

m m 
fe=0 k=0 
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Our first goal is to sliow that J2k f^k is a convergent series. Using \\h''\\ < H and 
denoting A := J^kLo '^h 



^/3fe < A//2 + ^4 + 2ff^afeefe + 2^||a;'=-x*||£,. 

k=0 k=0 k=0 k=0 

Now denote D := \\x° — x*\\ and consider tlie last term (without the factor 2): 

m m 

J2\\x'-x*\\sk = Deo + Y^\\Px"-' {x''-' - ak-ih''-') - x*\\ek 

k=0 k=l 
m 

< Deo + J2 W'^x'" (^'"' - "fc-i/i'"') - rx {x''-' - ak-ih''-') \\ Sk 

k=l 



+ J2 11^^ (^''"' - (^k-ih''~^) - x*\\ Sk 
fe=i 

< Deo + ^efe-i£fc + ^ ||a;'=-^ - ak-ih''-'^ - x*\\ Sk 

k=l k=l 

m—1 m— 1 m— 1 

fc=0 fc=0 fe=0 

m — 1 m—1 m—1 

< £) (£0 + £i) + J2 + Z] ll^'' ~ +H^ak Sk+i. (18) 

fc=0 fc=l *;=0 

Repeating this procedure to eliminate all terms lla;'^ ~ x*\\ for k > 0, we obtain 

m m m— J m— j 

(18) < ... < D^£fc + ^(^eft£fe+,+if ^afc£fe+,) 



fe=0 j=l /s=0 fe=0 

m m m— J 



= £»^efe + ^ ^(£fe + i?aft)£ft+j. (19) 

fe=0 j=l fe=0 

Using the above chain of inequalities, (8) and (10), and the abbreviation E := 
EfeLo ^k, we finally get: 



- X 



*r+2j2(fk-n'^k<D'+Y,i3k 

k=0 k=0 

mm mm m—j 

< AH^ + ^ £2 + 2 if ^ akSk + 2 D ^ £fc + 2 ^ ^ (£fc + iJafc) Sk+j 

fc=o fe=o fe=o i=i fe=o 

m m m—j m m— j 

< L»2 + AH^ + 2£>^£fc + 2^^ £fc£fe+,- + 2 // ^ ^ afc£fe+,- 

k=0 j=0 k=0 j=0 k=Q 
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mm mm 



k—0 j—0 k—j j—0 k—j 



< D"^ +AH^ + 2D^ek + 2^Esj+2H^ajaj 
fc=o j=o i=o 



< D^ + AH^ + 2{D + E)J2sk + 2H^al 

fe=0 fe=0 

< {D + Ef + E'^ + {2 + H)AH =: R < oo. (20) 



Since the iterates x'^ may be infeasible, possibly fk < /*, and hence the 
second term on the left hand side of (20) might be negative. Therefore, we 
distinguish two cases: 

i) If fk > /* for all but finitely many k, we can assume without loss of generality 
that fk > f* for all k (by considering only the "later" iterates). Now, because 
fk > f* for all k, 

mm m 

J2{fk -f*)ak>J2{ . jnin^ /, -/*) = {f^, - /* ) ^ a^. 

fe=0 k=0 i '"^'"^ ^ fe=0 

Together with (20) this yields 

m „ 

0<2(/4-r)^a,<i? ^ o</4-r< 



/c=0 



2Er=o "fc' 



Thus, because X^fcLo ^k diverges, we have f^ — t- /* for m — >■ oo (and, in 
particular, liminffc^oo fk = f*)- 

To show that /* is in fact the only possible accumulation point (and hence 
the limit) of (fk), assume that (fk) has another accumulation point strictly 
larger than /*, say f* + rj for some -q > Q. Then, both cases fk < f* + -^V 
and fk > f* + must occur infinitely often. We can therefore define two 
index subsequences (m^) and (ng) by setting n(_i) := —1 and, for i>0, 

me := min{fc | k > fk > f* + fr;}, 

ne := min{ k \ k > ma, fk < f * + \'n }• 

Figure 2 illustrates this choice of indices. Now observe that for any t, 

h < fme -fne<H- \\x"' - X^' || < H (||a;"*-l - X^^ \\ + Han,-1 + 

<...<ij2 |] a, +i? ^ (21) 

j=me j=me 
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r + v 

r + h 
r + h 



mo jiQ min\ni2 n2 Iteration k 

Fig. 2. The sequences {mi) and {ni). 



where the second inequaUty is obtained similar to (18). For a given m, let 
£„i :— max{ £ | — 1 < m } be the number of blocks of indices between two 
consecutive indices and ni — \ until m. We obtain: 



lE'^ ^ ^^E E + E ^ ^'E E +^^- 

£=0 i=0 j=me £=0 j=m^ f=0 j=me 



(22) 



For TO — cx), the left hand side tends to infinity, and since HE < oo, this 
implies that 

n,-\ 

E E ^ oo. 

£—0 j—mi 

Then, since afc > and /fe > /* for aU fc, (20) yields 



oo > i? > llx 



m+1 _^*|J2 



2^(A.-r)a, >2^(/fc~r)afe 

>2EE >i'?EE"i- 



j—nii 



£—0 j—7ni 



But for TO — >■ oo, this yields a contradiction since the sum on the right hand 
side diverges. Hence, there does not exist an accumulation point strictly 
larger than /*, so we can conclude — > /* as fc — > oo, i.e., the whole 
sequence (fk) converges to /*. 

We now consider convergence of the sequence (x*"'). From (20) we conclude 
that both terms on the left hand side are bounded independently of to. In 
particular this means {x'^) is a bounded sequence. Hence, by the Bolzano- 
Weierstrai3 Theorem, it has a convergent subsequence (x*^') with x*^' — >■ x 
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(as i — >■ oo) for some x. To show that the full sequence {x'^) converges to x, 
take any K and any kf < K and observe from (17) that 

\\x^-xr<\\x'^'-xr + j^f3j. 

Since j5k is a convergent series (as seen from the second last line of (20)), 
the right hand side becomes arbitrarily small for fc; and K large enough. 
This implies x'^ — >■ x, and since £fc 0, fk ^ f* , and X* is closed, x S X* 
must hold. 

ii) Now consider the case where fk < f* occurs infinitely often. We write (/jT) 
for the subsequence of (fk) with fk < f* and (/^) for the subsequence 
with fk > f*. Clearly f^ — >■ /*. Indeed, the corresponding iterates are 
asymptotically feasible (since the projection accuracy £k tends to zero), and 
hence /* is the only possible accumulation point of {fk)- 

Denoting M' = {k < m \ fk < /*} and M+ = {k < m \ fk > /*}, we 
conclude from (20) that 

-x*f + 2 J2 (fk -f*)ak<R + 2 Y.{f*- fk) cck. (23) 

fceM+ fe6M~ 

Note that each summand is non-negative. To see that the right hand side is 
bounded independently of to, let y'^~^ = x''~^ — ak-ih''^^, and observe that 
here {k e M"), due to fk<f*< f{Vx{y^-^)), we have 

r-fk< fiVxiy"-')) - fivT'iv"-')) 

< {h'-y{rxiy''-')-rT'{y''-')) 

< Wh'-'W ■ \\Vx{y'-') - V'^'-'{y''-^)\\ < Hsk-u 

using the subgradient and Cauchy-Schwarz inequalities as well as property (7) 
oiVx and the boundedness of the subgradient norms. From (23), using (9) 
and (10), we thus obtain 

+ 2 J2 {fk-nak<R + 2H ^ akSk-i 
keM+ keM- 

oo 

< R+2H ^ akak-i < R + 2H'^akak-i < R + 4.AH < oo. (24) 

keM- k=0 

Similar to case i), we conclude that both the sequence {x'') and the series 
Efe£M+(/fe - /*) "fc 'ire bounded. 

It remains to show that f^^f*. Assume to the contrary that {fj^) has 
an accumulation point f* +r] fov rj > 0. Similar to before, we construct index 
subsequences (me) and (pe) as follows: Set := —1 and define, for ^ > 0, 

me := min{ k € M+ \ k > pe-i, fk>f* + §?? }, 
pe := min{ k G \ k > me}. 
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Then m^, . . . — 1 G M+ for all ^, and we have 

j=me j=me 

Therefore, with := max{ ^ | — 1 < m } for a given m, 

e=0 i=0 j=mt 1=0 3=mt 1=0 j=me 

Now the left hand side becomes arbitrarily large as to — > oo, so that also 
Yle=o Y^^'=me '^3 ~^ since HE < oo. Note that because > and 

^™ pe-i 

E E ^ E "fc' 

e=0 j=me keM+ 

this latter series must diverge as well. As a consequence, /* is itself an (other) 
accumulation point of (f^)- Prom (24) we have 

oo > i? + AAH >2 J2 ifk- rhk 

keM+ 

> J2 (mm{ fj\j€M+,j<m} -n ak 

4- ^ V ' 

and thus 

0<f^-f< ^ >0 asm^oo, 

since J2keM+ diverges. But then, knowing (/^) converges to /*, we can 
use {nii) and another index subsequence (n^), given by 

ne ■■= min{ k € M+\k > me, fk < f* + }, 

to proceed analogously to case i) to arrive at a contradiction and conclude 
that no 77 > exists such that /* + 77 is an accumulation point of (/^). 

On the other hand, since {x'^) is bounded and / is continuous on a neigh- 
borhood of X (recall that for all k, is contained in an efc-neighborhood 
of X), (/^) is bounded. Thus, it must have at least one accumulation point. 
Since fk > f * for all k G the only possibility left is /* itself. Hence, /* 
is the unique accumulation point (i.e., the limit) of the sequence {f^). As 
this is also true for (/,^), the whole sequence (fk) converges to /*. 

Finally, convergence of the bounded sequence {x'^) to some x G X* can 
now be obtained just like in case i), completing the proof. □ 



= (/m - /*) E 

keM^ 
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3.2 ISA with dynamic Polyak-type step sizes 



Let us now turn to dynamic step sizes. In the rest of this section, ak wiU always 
denote step sizes of the form (2). 

Since in subgradient methods the objective function values need not decrease 
monotonically, the key quantity in convergence proofs usually is the distance to 
the optimal set X* . For ISA with dynamic step sizes (Algorithm 2), we have the 
following result concerning these distances: 

Lemma 1. Let x* e X* . For the sequence of ISA iterates {x'^), computed with 
step sizes = Xk{fk ~ holds that 

.fc+l ^*l|2 ^ lufc ^*||2 I ^2 I o (^k{fk - , ii^fc 



+ ^'fkp^(A.(/.-^)-2(/.-r)). (25) 

In particular, also 

dx^x'^+'f < dx-i^'f -2akifk - n + {ak\\h''\\+ekf + 2dx-^{x'')eu. (26) 

Proof. Plug (2) into (16) for x = x* and rearrange terms to obtain (25). If the 
optimization problem (1) has a unique optimum x* , then obviously Hx*^ — = 
dx'{x^) for all k, so (26) is identical to (25). Otherwise, note that since X* is 
the intersection of the closed set X with the level set {x \ f{x) = /*} of the 
convex function /, X* is closed (cf., for example, [26, Prop. 1.2.2, 1.2.6]) and 
the projection onto X* is well-defined. Then, considering x* = Vx'ix'^), (16) 
becomes 

- Px' (x'^)!!' < dx' {x^f - 2akifk - r ) + {ak\\h>'\\+Skf + 2 dx- {x'^) e^. 

Furthermore, because obviously fiVx'ix)) = f{Vx'{y)) = f* for all x,y € IR", 
and by definition of the Euclidean projection, 

dx'{x>'+^f = \\x''+^-Vx'{x''+^)f < \\x>'+^-Vx'{x'')f. 

Combining the last two inequalities yields (26). 

Moreover, note that these results continue to hold true if is replaced in 
a projection refinement phase (starting in the next iteration k + 1), since then 
only accuracy parameters smaller than Sk are used. □ 

Typical convergence results are often derived by showing that the sequence 
(llx*^ — X* II) is monotonically decreasing (for arbitrary x* €E X*) under certain as- 
sumptions on the step sizes, subgradients, etc. This is also done in [2], where (25) 
with Sk = for all k is the central inequality, cf. [2, Prop. 2]. In our case, i.e., 
working with adaptive approximate projections as specified by (7), we can follow 
this principle to derive conditions on the projection accuracies (e^) which still 
allow for a (monotonic) decrease of the distances from the optimal set: If the 
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last summand in (25) is negative, the resulting gap between the distances from 
X* of subsequent iterates can be exploited to relax the projection accuracy, i.e., 
to choose Efe > without destroying monotonicity. 

Naturally, to achieve feasibility (at least in the limit), we will need to have (e^) 
diminishing (e^ — >■ as fc — >■ oo). It will become clear that this, combined with 
summability (X^fcLo^fe < with monotonicity conditions as described 

above, is already enough to extend the analysis to cover iterations with fk < f*, 
which may occur since we project inaccurately. 

For different choices of the estimate of /*, we will now derive the proofs of 
Theorems 2 and 3 via a series of intermediate results. Corresponding results for 
exact projections {sk = 0) can be found in [2]. In fact, our analysis for adaptive 
approximate projections improves on some of these earlier results (e.g., [2, Prop. 
10] states convergence of some stifesequence of the function values to the optimum 
for the case ip < /*, whereas Theorem 3 in this paper gives convergence of the 
whole sequence {fk), for approximate and also for exact projections). 

For the remainder of this section we can assume that ISA does not terminate 
in Step 6 and that all inner projection accuracy refinement loops arc finite. 
Otherwise, there is some refinement phase starting at iteration k such that, as 
i ^ oo, X- is repeatedly reset to 

yi ■■= Vt'--\x^-' - ak-,h^-') ^ Vl{x^-' - ak-,h^-') € X, 
with f{yl) ^ < if; cf. Remarks 3 and 4. 

Using overestimates of the optimal value. In this part we will focus on 
the case <^ > /*. As might be expected, this relation allows for eliminating the 
unknown /* from (26). 

Lemma 2. Let ^> f* and Afc > 0. If fk> ^ for some fc e IN, then 

Proof. This follows immediately from Lemma 1, using fk > f > f* and > 0. 

□ 

Note that ISA guarantees > </? by sufficiently accurate projection (oth- 
erwise the method stops or the inner refinement loop over i, with fixed k, is 
infinite, indicating (p was too large, sec Steps 3-7 of Algorithm 2), and that the 
last summand in (27) is always negative for < Afe < 2. Hence, adaptive ap- 
proximate projections {sk > 0) can always be employed without destroying the 
monotonic decrease of {dx*{x'^)), as long as the Sk arc chosen small enough. 

The following result provides a theoretical bound on how large the projection 
accuracies ek may become. 
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Lemma 3. Let < Afe < 2 for all k. For (f > f* , the sequence {dx*{x'')) is 
monotonically decreasing and converges to some C ^ 0, < < £/j for all k, 
where is defined in (11) of Theorem 2. 

Proof. Considering (27), it suffices to show that for Sk < Cfe, we have 
^2,^{ ^fc(/fc-y) , . ^ , Afc(Afc-2)(/fc-y)2 

The bound from (11) is precisely the (unique) positive root of the quadratic 
function in Sk given by the left hand side of (28). Thus, we have a monotonically 
decreasing (i.e., nonincreasing) sequence (dx' {x'^)), and since its members are 
bounded below by zero, it converges to some nonncgativc value, say □ 

As a consequence, if X* is bounded, we obtain boundedness of the iterate 
sequence (a;*^): 

Corollary 1. Let X* he hounded. If the sequence {dx*{x'')) is monotonically 

decreasing, then the sequence (x*^) is bounded. 

Proof. By monotonicity of {dx* {x'')), making use of the triangle inequality, 

< dx^{x'') + \\Vx^ix'')\\ < dx^{x°)+ sup\\x\\ < 00, 

since X* is bounded by assumption. □ 
We now have all the tools at hand for proving Theorem 2. 

Proof of Theorem 2. First, we prove part (i). Let the main assumptions of 
Theorem 2 hold and suppose — contrary to the desired result (i) — ^that fk > ^f + S 
for all k (possibly after finitely many refinements of the projection accuracy used 

to compute x''). By Lemma 2, 

Xk{2-Xk){fk-^ < dx^ix^r-dx^ix'+'f 



+ dx'{x^) Sk. 



\m\ 

Since < F < Wh'^W < F < oo, < Afe < ^ < 2, and - > 5 for all fc by 
assumption, we have 

\k{2^\k){fk-^? Afe(2-/?)J^ 

By Lemma 3, dx* {x'^) < dx* (a;"). Also, by Corollary 1 there exists F < oo such 
that fk<F for all k. Hence, Xk{fk - <p) < P{F - if), and since l/||/i'=|| < 1/H, 
we obtain 

^^^^Xk < dx^{x''r-dx^{x>'+'r+el+2 (^^il^+dx^{x')^ e^. (29) 
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Summation of the inequalities (29) for = 0, 1, . . . , m yields 
H fe=o 



fe=o ^ — ^ fe=o 



Now, by assumption, the left hand side tends to infinity as m — )• oo, while the 
right hand side remains finite (note that nonncgativity and summability of [vk) 
imply the summability of (i^^), properties that carry over to (£fc)). Thus, we have 
reached a contradiction and therefore proven part (i) of Theorem 2, i.e., that 
Ik ^ <y5 + 5 holds in some iteration K. 

We now turn to part (ii) : Let the main assumptions of Theorem 2 hold, let 
\k ^ Q and suppose fk> ^ for all k (again, possibly after refinements). Then, 
since wc know from part (i) that the function values fall below every we can 
construct a monotonically decreasing subsequence {Jk^ ) such that — >■ ^- (To 
see this, note that ii fk < ^ + 5 is reached with fk < <p, the ensuing refinement 
phase not necessarily ends with replaced by a point with (p < fk < (fi + 5, 
but that then, however, there always exists a K > k such that (p < Jk < 'fi + S, 
since Afc — )• 0, Sfe — )• 0, and by continuity of /.) 

To show that (p is the unique accumulation point of (/fe), assume to the 
contrary that there is another subsequence of (fk) which converges to <^ + ry, 
with some ?7 > 0. We can now employ the same technique as in the proof of 
Theorem 1 to reach a contradiction: 

The two cases < |ry and fk > (p+'^rj must both occur infinitely often, 
since if and ^p + r] are accumulation points. Set := —1 and define, for ^ > 0, 

me ■■= min{ A; | fc > n^_i, fk> V+^v}, 
ne := min{ k \ k > me, fk < if + Iv}. 

Then, with oo > F > /fe for all k (existing since (a;*^) is bounded and therefore 
so is (/fe)) and the subgradient norm bounds, we obtain 

h < /„, - /„, < - x-^ II < E + ^ E 

j=me j=me 

and from this, denoting im '■= max{ ^ | — 1 < m } for a given m. 



e=0 — i=Oj=me t=Oj=me 

Since for m ^ oo, the left hand side tends to infinity, the same must hold for 
the right hand side. But since J2e=o Sjim^< - Sfelo < Z^feLo < oo, this 
implies 

nt-l 

Xj — )• oo for m — )• oo. (30) 

e=0 j=me 
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Also, using the same estimates as in part (i) above, (27) yields 
^ ih - ^fXk < dx' {x'^r- dx' {x'^+r + 4 + 2 + dx' {x°)) e, 

^ ' 

=:Ci<oo =:C2<CX) 

and thus by summation for /c = 0, . . . , m for a given m, 

m mm 

Ci J2{fk - v)'Afc < dx^ix^^f - dx^x^+^f + E^fc + ^2 E^'^- (31) 

fc=0 fc=0 fe=0 

Observe that all summands of the left hand side term are positive, and thus 



fe=0 t=0 j=me "J ' e=0 3=me 



Therefore, as m — )■ oo, the left hand side of (31) tends to infinity (by (30) and 
the above inequality) while the right hand side expression remains finite (recall 
< £fe < i^k with (z/fe) summablc and thus also squarc-summablc) . Thus, we 
have reached a contradiction, and it follows that ip is the only accumulation 
point (i.e., the limit) of the whole sequence [fk)- 

This proves part (ii) and thus completes the proof of Theorem 2. □ 

Remark 5. With more technical effort one can argue along the lines of the proof 
of Theorem 1 to obtain the following result on the convergence of the iterates x'' 
in the case of Theorem 3: If we additionally assume that < oo and that 

Afc > '^'jLk f"-*^ ^' then x'^ x for some x G X with f{x) = tp and 
dx' {x) = ( >0 {( being the same as in Lemma 3). 



Using lower bounds on the optimal value. In the following, we focus on 
the case ip < f* , i.e., using a constant lower bound in the step size definition (2). 
Such a lower bound is often more readily available than (useful) upper bounds; 
for instance, it can be computed via the dual problem, or sometimes derived 
directly from properties of the objective function such as, e.g., nonnegativity of 
the function values. 

Following arguments similar to those in the previous part, wo can prove con- 
vergence of ISA (under certain assumptions), provided that the projection ac- 
curacies (efe) obey conditions analogous to those for the case </?>/*. Moreover, 
recall that for ip < f* , every refinement phase is finite, so that fk>pis guaran- 
teed for all k; in particular. Step 6 is never executed since Xn{x \ f{x) < i^} = 0. 

Let us start with analogues of Lemmas 2 and 3. 

Lemma 4. Let p < f * and < Xk < (3 < 2. If fk ^ P for some k G IN, then 

dx'ix'^+r < dx^ {x'^f +4 + 2 [ ^'^l^k\\ + dx> {x'^)) Sk + L,, (32) 
where Lk is defined in (12) of Theorem 3. 
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Proof. For < f* , < Xk < /3 < 2, and fk > ^p, it holds that 

Xkifk -^)- 2{.fk -n < PUk -^)- 2(/fe - /*) = /3(/* - (p) + (2 - /?)(/* - fk). 

The claim now follows immediately from Lemma 1. □ 

Lemma 5. Let tp < f* , let < Xk < P < 2 and fk > f* + j^if* ~ f) 
all k, and let L^, be given by (12). Then {dx*{x^)) is monotonically decreasing 
and converges to some ^>0,ifO<ek<ik for all k, where ik is defined in (13). 

Proof. The condition fk > f * + 2^(/* ^ f) implies Lk < and hence ensures 
that adaptive approximate projection can be used while still allowing for a de- 
crease in the distances of the subsequent iterates from X* . The rest of the proof 
is completely analogous to that of Lemma 3, considering (32) and (12) to derive 
the upper bound ik given by (13) on the projection accuracy. □ 

We can now state the proof of our convergence results for the case </?</*. 

Proof of Theorem 3. Let the assumptions of Theorem 3 hold. We start 
with proving part (i): Let some S > he given and suppose — contrary to the 
desired result (i) — that fk > f* + 2^ if* — f) + S for all k (possibly after 
refinements). By Lemma 4, 



dx'{x'+'Y < dx* {x'^f +sl + 2 [ + '^^^ V ^ 

Since < H < \\h''\\ < H, < Xk < P < 2, and ip < fk, and due to our 
assumption on fk, i.e., 

r-fk+2^{f*-^) < -6 for all fc, 

it follows that 

Xk{2-p){fk-ip)6 

Lk < ^5 < U. 

H 

By Lemma 5, dx*{x^) < dx'{x°), and Corollary 1 again ensures existence of 
some F < 00 such that fk<F for all k. Because also Xk{fk — < /3{F — (p) 
and l/||/i'^|| < 1/^, we hence obtain 

Xk{2-mk-^)6 ^ dx^i^^f-dxAx^^^f 



H 



+ el + 2{^-^^^+dx>{x')]ek. (33) 



Summation of these inequalities for fc = 0, 1, . . . , m yields 

{2-p)5 

fe=0 



Y,{fk - v)Xk < dx' {xy - dx- {x'^+' f 

+ f:el + 2(^-^^+dx.{x^))±ek. (34) 

fe=0 ^ — ^ k=0 
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Moreover, our assumption on fk yields 

fk-'P > r + ^r-^^+s-^ = 2^(r-^)+<5. 

It follows from (34) that 

H fe=0 

fc=0 ^ — ' k=0 

Now, by assumption, the left hand side tends to infinity as m — )• oo, whereas by 
Lemma 5 and the choice of < cfe < min{|efe|, i/k} with a nonncgativc summable 
(and hence also square-summable) sequence (t'ft), the right hand side remains 
finite. Thus, we have reached a contradiction, and part (i) is proven, i.e., there 
does exist some K such that fx < f* + 23^3 (/* — (after possible refinements 
of the projection accuracy used to recompute x^). 

Let us now turn to part (ii): Again, let the main assumptions of Theorem 3 
hold and let Afe — >■ 0. Recall that for (p < /*, we have fk > ^ for all k by 
construction of ISA (refinement loops). We distinguish three cases: 

If fk < f* holds for all k > ko for some ko, then fk — ?■ /* is obtained 
immediately, just like in the proof of Theorem 1, by asymptotic feasibility. 

On the other hand, if fk > f* for all k larger than some ko, then repeated 
application of part (i) yields a subsequence of (fk) which converges to /*: For 
any S > we can find an index K such that f* < fx ^ f* + 2^(f* — 'P) + ^■ 
Obviously, we get arbitrarily close to /* if we choose /3 and S small enough. 
However, we have the restriction < /3. But since Xk — >■ 0, we may "restart" our 
argumentation if Xk is small enough and replace /3 with a smaller one. With the 
convergent subsequence thus constructed, we can then use the same technique 
as in the proof of Theorem 2 (ii) to show that (fk) has no other accumulation 
point but /*, whence fk — > /* follows. 

Finally, when both cases fk < f* and fk > f* occur infinitely often, we 
can proceed similar to the proof of Theorem 1: The subsequence of function 
values below /* converges to /*, since Sk 0. For the function values greater 
or equal to /* , we assume that there is an accumulation point f* +r] larger than 
/*, deduce that an appropriate sub-sum of the A^'s diverge and then sum up 
equation (33) for the respective indices (belonging to {k \ fk > /*}) to arrive at a 
contradiction. Note that the iterate sequence {x'^) is bounded, due to Corollary 1 
(for iterations k with fk > /*) and since the iterates with < fk < f* stay 
within a bounded neighborhood of the bounded set X*, since ck tends to zero 
and is summable. Therefore, as / is continuous on a neighborhood of X (which 
contains all from some k on), (fk) is bounded as well and therefore must 
have at least one accumulation point. The only possibility left now is /*, so we 
conclude fk^f*- □ 

Remark 6. With — >• /* and £k — >■ 0, we obviously have dx'ix'') — >■ in 
the setting of Theorem 3. Furthermore, Remark 5 applies similarly: With more 
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conditions on Afc and more technical effort one can obtain convergence of the 
sequence (a;*^) to some x G X*. 

4 Discussion 

In this section, wc will discuss extensions of ISA. We will also illustrate how to 
obtain bounds on the projection accuracies that are independent of the (generally 
unknown) distance from the optimal set, and thus computable. 

4.1 Extension to e-subgradients 

It is noteworthy that the above convergence analyses also work when replacing 
the subgradients by e-subgradients [6], i.e., replacing df{x'') by 

dp,fix''):={heJR:''\fix)-f{x'')>h^ix-x'')-pk V.TelR"}. (35) 

(To avoid confusion with the projection accuracy parameters e^, we use pk.) For 
instance, we immediately obtain the following result: 

Corollary 2. Let ISA (Algorithm 1 ) choose h'^ € dp^f{x'') with pk > for all k. 
Under the assumptions of Theorem 1, if{pk) is chosen summable (X^^lo Pk < oo) 
and such that 

(i) Pk ^ fJ-ctk for some /x > 0, or 

(ii) Pk < I^Sk for some /x > 0, 

then the sequence of ISA iterates {x'^) converges to an optimal point. 

Proof. The proof is analogous to that of Theorem 1; we will therefore only sketch 
the necessary modifications: Choosing h'' G dp^f{x'') (instead of h'^ G dfix'^)) 
adds the term +2akPk to the right hand side of (16). If pk < ficxk for some 
constant /i > 0, the square-summability of (ak) suffices: By upper bounding 
2akPk, the constant term +2iiA is added to the definition of R in (20). Similarly, 
Pk < M^fe does not impair convergence under the assumptions of Theorem 1, 
because then the additional summand in (20) is 

m m m oo m 

2^akpk < '^'P-^akSk < 2/x^(^afe^efc) < 2ii^al < 2iiA. 

k=0 k=0 fe=0 e=k k=0 

The rest of the proof is almost identical, rising R modified as explained above and 
some other minor changes where pfe-terms need to be considered, e.g., the term 
+pme is introduced in (21), yielding an additional sum in (22), which remains 
finite when passing to the limit because (pk) is summable. □ 

Similar extensions are possible when using dynamic step sizes of the form (2). 
The upper bounds (11) and (13) for the projection accuracies (e^) will depend 
on (pk) as well, which of course must be taken into account when extending the 
proofs accordingly. Then, summability of (pk) (implying pk — > 0) is enough to 
guarantee convergence. In particular, one can again choose pk < fi£k for some 
> 0. We will not go into detail here, since the extensions are straightforward. 
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4.2 Variable target values 

From a practical viewpoint, it may be desirable to have an algorithm, using 
dynamic step sizes, that does not require the user to know a priori whether 
an estimate (p is larger or smaller than /* , respectively. Moreover, relying on 
a constant estimate may lead to overly small or large steps, which slows down 
the convergence process (and, w.r.t. ISA (Algorithm 2), can also lead to many 
projection accuracy refinement phases). Thus, a typical approach is to replace 
the constant estimate (p by variable target values (pk- These target values are 
then updated in the course of the algorithm to increasingly better estimates 
of /*, so that the dynamic step size (2) more and more resembles the "ideal" 
Polyak step size (which would use = /*). In principle, such extensions are also 
possible for the ISA framework. We briefly describe the most important aspects 
in the following. 

First, note that Theorems 2 and 3 provide bounds on the projection accura- 
cies (sk) needed for convergence; clearly, if it is unknown whether ipk > f* or 
fk < /*, one must therefore choose < Sfe < min{£fe, |£/;|, t'^}, with and ik 
given by (11) and (13), respectively. 

Crucial for any variable target value method is the ability to somehow recog- 
nize whether ipk ^ /* or cpk < f* ■ If all iterates are feasible, this amounts to rec- 
ognizing whether Xn{a; | f{x) < (pk} 7^ (or, as x € X, simply f{x) < (pk), im- 
plying ifik> f*,OT: Xn{x\ f{x) < ipk} = 0, to infer that ipk < f*, see, e.g., [14]. 
However, in the case of (possibly) infeasible iterates, fk < (fik does not necessar- 
ily imply that (pk is too large. On the other hand, viewing the ISA iterates x'^ as 
points of the "blown-up" feasible set B^'^ := {x \ x = y+z, y G X, \\z\\ < Sk-i}, 
then n {x \ f{x) < pk} = also implies that cpk < ./*, since X C . 

In view of Theorem 3, keeping (pk constant once we recognized that (pk < f* 
ensures convergence of {fk) to /* (in practice, it may nevertheless be desirable 
to further improve the estimate ipk in order to avoid overly large steps in the 
vicinity of the optimum). The associated case B^ H {x \ f{x) < ipk} = can be 
detected in practice, see [14, Section III.C] for details in the case of a feasible 
method; these results are extensible to the ISA framework with appropriate 
modifications. 

The other case, (pk > /*) could be detected, e.g., with the help of an esti- 
mate of the Lipschitz constant of / (recall that every convex function is locally 
Lipschitz and useful estimates should usually be available) and the distances to 
X implied by the projection accuracies. 

In the literature, various schemes have been considered as update rules for 
variable targets (</?fe), see, e.g., [5,28,19,48,40,14,31,36]. In principle, many 
such rules could be straightforwardly used in, or adapted to, a variable target 
value ISA. 

4.3 Computable bounds on dx'{x'') 

The results in Theorems 2 and 3 hinge on bounds Sk and ik on the projec- 
tion accuracy parameters ek, respectively. These bounds depend on unknown 
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information and therefore seem of little practical use such as, for instance, an 
automated accuracy control in an implementation of the dynamic step size ISA. 
While the quantity /* can sometimes be replaced by estimates directly, it will 
generally be hard to obtain useful estimates for the distance of the current iter- 
ate to the optimal set. However, such estimates are available for certain classes 
of objective functions. We will sketch several examples in the following. 

For instance, when / is a strongly convex function, i.e., there exists some 
constant C > such that for all x, y and fi G [0, 1] 

< txf{x) + {l-fi)f{y)-Cfi{l-i,)\\x-yf, 

one can use the following upper bound on the distance to the optimal set [28] : 

dx^ix) < min { ^i^, ^ ^mm ^ \\h\\ }. 

For functions / such that f{x) > C\\x\\ — D, with constants C,D > 0, one 
can make use of dx»{x) < \\x\\ + ^(/* + -D), obtained by simply employing the 
triangle inequality. Another related example class is induced by coercive self- 
adjoint operators F, i.e., f{x) := {Fx,x) > C||a;|p with some constant C > 
and a scalar product (•,•). The (usually) unknown /* appearing above may again 
be treated using estimates. 

Yet another important class is comprised of functions which have a set of 
weak sharp minima [18] over X, i.e., there exists a constant ;U > such that 

f{x)-r > f-^dx^ix) yx€X. (36) 

Using dx*{x) < dx{x) + dx*{Vx{x)) for x e H", we can then estimate the 
distance of x to X* via the weak sharp minima property of /. An important 
subclass of such functions is composed of the polyhedral functions, i.e., / has 
the form f{x) = max{ aj x + hi\l <i < N}, where ai ^0 for all i; the scalar /j, 
is then given hy fi = min{ ||ai|| 1 1 < i < A'' }. Rephrasing (36) as 

dx*{x) < ^''^^ ~ ^* \fx€X, 
we see that for if < f* (e.g., dual lower bounds </?), 

A* 

Thus, when the bounds on the distance to the optimal set derived from using 
the above inequalities become too conservative (i.e., too large, resulting in very 
small £fc-bounds) , one could try to improve the above bounds by improving the 
lower bound (f. 

In practice on might have access to (problem-specific) estimates of dx*{x); 
in [14], it is claimed that "for most problems" prior cxpcriencic or hcuristical con- 
siderations can be used to that end. For instance, if X is compact, the diameter 
of X leads to the (conservative) estimate dx* {x) < diam(X) -|- dx{x). 
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5 Examples 



In this section, wc briefly discuss two examples in which we can design adap- 
tive approximate projections as considered in the ISA framework. In the first 
example, we focus on the theoretical aspects of how our notion of adaptive 
approximate projection could be used to handle a certain class of constraints 
appearing in stochastic programs. The second application considers a (deter- 
ministic) optimization problem for which we specialize ISA and present some 
numerical experiments. 



5.1 Convex expected value constraints 

We consider expected value constraints [47, 33] of the following form 

g{x) := E[/(x; w)] = / /(x; to) p{u) duj < 7], (37) 
Jn 

where E denotes the expected value, uj € fi C IR'' is a vector of random variables 
with density p, x are deterministic variables in IR" , / : IR" x IR"^ — >• IR, and e IR. 
If / is convex in x for every ui G (37) is a convex constraint. Expected value 
constraint appear in stochastic programming as, for instance, the expectational 
form of chance constraints, see, e.g., [11,7], or when modeling expected loss or 
Value-at-Risk via integrated chance constraints, see, e.g., [21,27,22]. 

While generally g{x) cannot be easily computed exactly, it can be approxi- 
mated using Monte Carlo methods, if samples of lj can be (cheaply) generated. 
Here, taking M independent samples w^, . . . yields the approximation 

1 ^ 

9M{x):=-J2f{x;u;') (38) 

i=l 

of g{x). Moreover, we assume that we can compute a subgradient G{x;uj) €E 
dxf{x\uj) for each value of x and w. Thus, we have h := E[G(a;;w)] e dg{x). We 
then use the approximation 

1 ^ 

hM{x):=jjY.^{x-,'^'), (39) 

i=l 

which is a "noisy unbiased subgradient" of g at x: sec [8] for details. 

Considering the Lagrangean L(y, A) = ^jja; — j/jj^ -|- A {gijj) — rj) of the pro- 
jection problem for some point x and the set of feasible points w.r.t. (37), the 
optimality conditions for the projection obtained by differentiating L are 

—X + y + Xh = 0, for some h G dg{y), (40) 
g{y) -v = 0. (41) 
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Then, the idea is to replace g{y) and h by the estimates guiy) and fiMiv), 
respectively. An adaptive approximate projection is obtained by solving 



y = x-XhM{y), 9M{y) = ri. 



(42) 



For an appropriate sampling process, we can adaptively keep control on the 
resulting projection error (with high probability). 

We now demonstrate this approach on a simple example constraint in which 
the above system can be solved easily and we obtain explicit projection error 
bounds: Consider a linear bmction with random cocfhcicnts, i.e., /(.x;cj) = ui^ x 
and q = n. This particular type of constraint is closely related to integrated 
chance constraints which are used, for instance, to model bounds on expected 
losses of some sort; see, e.g., [21,27]. For this choice of /, our Monte Carlo 
estimates are 



1 ^ . 

hrnix) = = and 5m (a;) = hj^x. 



(43) 



Note that if E[/im(2:)] is imknown, the feasibility operator construction in [23] 
is not applicable. Moreover, assuming h, Hm corresponds to imposing a 
lower bound on the subgradient norm, like in the convergence theorems for ISA. 
Observing that fiM is independent of x (so in particular, fiMiv) = hu as well), 
we can solve (42) to obtain the solution 



V'^ix) ■=x^ 



\hM\? 



(44) 



to the approximated projection problem. The exact projection is given by 



and — as the notation suggests — we have 'P°°{x) = liuiM^ca'P'^^ (x) almost- 
surely, since Prob(limM-)-oo = h) = 1 by the (strong) law of large numbers. 

For sufficiently large M, we can use explicit (1 — a)-confidence intervals for 
the expected value h = ]E[/im] via the central limit theorem, and eventually 
obtain 

Prob( \\r^{x) - r^{x)\\ <SM)=l-a, (46) 

where 



hJiX - 7] 



M 



hljx -r) + c-qjjs 



(/iM + C • Qm] 



with c = ~sign{hl^qM) and 

9(l-a/2) 



M 



13((a;*)l-(Ml)^.. 



M 



\ i=l 
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where g(i_a/2) denotes the (1 — f )-quantile of the standard normal distribution. 
Thus, for any given a € (0, 1) and for sufficiently large M, defines an 
adaptive approximate projection operator as specified in the ISA framework, 
with probability 1 — a. 

It is noteworthy that the projection accuracy directly depends on M, and in 
the linear example above we could itcratively refine the estimate fiM easily by 
incorporating newly drawn independent samples. 



5.2 Compressed sensing 

Compressed Sensing (CS) is a recent and very active research field dealing, 
loosely speaking, with the recovery of signals from incomplete measurements. 
We refer the interested reader to [17, 9, 15] for more information, surveys, and 
key literature. A core problem of CS is finding the sparsest solution to an un- 
derdetcrmincd linear system, i.e., 

min llxllo s.t. Ax = b, {Ae R'"''", rank(A) =m, m< n), (47) 

where ||a;||o denotes the £o quasi-norm or support size of the vector x, i.e., the 

number of its nonzero entries. This problem is known to be AfV-hard. Hence, a 
common approach is considering the convex relaxation known as £i-minimization 
or Basis Pursuit [12]: 

min||a;||i s.t. Ax = b. (48) 

It was shown that under certain conditions, the solutions of (48) and (47) coin- 
cide, sec, e.g., [10, 17]. This motivated a large amoimt of research on the efficient 
solution of (48), especially in large-scale settings. In this section, we briefly out- 
line a specialization of the ISA to the -minimization problem (48) and present 
some numerical experiments indicating that the algorithm is an interesting can- 
didate in the context of Compressed Sensing. 

Subgradients. The subdifferential of the ^i-norm at a point x is given by 

9||a;||i = I /i e [-1,1]" /ii = A, Vi G {l,...,n} with 9^ |. (49) 
We may therefore simply use the signs of the iterates as subgradients, i.e., 



d\\x''\\i3h'' := sign(a;'') = < 



1, {x%>0, 

0, ix%=0, (50) 
[ -1, {x%<0. 



As long as 6 ^ 0, the upper and lower bounds on the norms of the subgradients 
satisfy H_>1 and H <n. 
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Adaptive approximate projection. For linear equality constraints as in (48), 
the Euclidean projection of a point z € K" onto the afRne feasible set X := 
{ X I Ax = b} can be explicitly calculated as 

Vx{z) = {I-A'^{AA'^)-'^A)z + A'^{AA'^)-H, (51) 

where I denotes the (n x n) identity matrix. However, for numerical stability, we 

wish to avoid the explicit calculation of the projection matrix because it involves 
determining the inverse of the matrix product AA'^ . Instead of applying (51) in 



each iteration, we can use the following adaptive procedure: 

z'' := x'' — akhf' (unprojected next iterate), (52) 
find an approximate solution g*^ of AAJ q = Az'^ — b, (53) 



Note that the matrix AA^ is symmetric and positive definite, for A with full 
(row-)rank m. Hence, the linear system in (53) can be solved by an iterative 
method, e.g., the method of Conjugate Gradients (CG) [24]. 

For a given e^, stopping the CG procedure in (53) as soon as the iteratively 
updated approximate solution q'^ satisfies 

WAA'^q'' -{Aix" -akh'')-b)\\2 < a^in{A)ek, (55) 

where C7min(^) > is the smallest singular value of A, ensures that (52)-(54) 
form an adaptive approximate projection operator of the type (7). Note that a 
truncated CG procedure (with any fixed number of iterations) can also be shown 
to define a "feasibility operator" of the type considered in [23] . 

Furthermore, to obtain computable upper bounds on (e^), we can use the 
results about weak sharp minima discussed in the previous section: The £i-norm 
can be rewritten as a polyhedral function. With if < f * (which is easily available, 
e.g., ip — 0), we can thus derive 

...(.^) < 2Mi^ + MlU^. 

O-min(^) Vn 

In total, this yields bounds that can be easily computed from the original data 
only. Theorems 1, 2, or 3 then provide explicit convergence statements. 

Numerical Experiments It is well-known that (48) can be solved as a linear 
program (LP), e.g., employing the standard variable split x = x~^ — x~: 

min x'^ + x~ s. t. Ax^ — Ax~ =6, x~^ > 0, x~ > 0. (56) 

Another common approach to (48) is to solve a sequence of regularized problems 
of the form 

min Ima; - 6II2 +r||a;||i, (57) 
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with decreasing r. As r — >■ 0, the solution sequence x{t) of (57) converges to a 
solution of (48). The homotopy method (see, e.g., [43,39]) traces this solution 
path for decreasing r and has the desirable property to require only k steps to 
reach the optimal solution x* to (48), if x* has only k nonzero entries and k is 
sufBciently small. 

Wc performed experiments to compare our ISA Algorithm 2, applied to (48) 
(using adaptive approximate or exact projections), with the commercial LP- 
solver Cplex 12.5 (dual simplex method applied to (56)) and the homotopy im- 
plementation (version 1.0) available at http : //users . ece . gatech. edu/~sasif / 
homotopy/. In our ISA implementation we employ at most 5 CG steps to ap- 
proximate the projection; albeit differing from theory, this turned out to suffice. 
Moreover, the subgradients are stabilized as in [37], and the parameter Afc is 
halved after 5 consecutive iterations without relevant improvement of the objec- 
tive (Ao = 0.85); the method terminates when the step sizes become too small 
or if a stagnation of the algorithmic process is detected. By stagnation, we mean 
that either the objective improvement stalls over a span of 500 iterations, or 
the approximate support S = {i : \x^ \ > max{10~^,s}} does not change over 
10 successive updates, which are performed every m/100 iterations; here s is 
chosen such that the entries Xj with \xj\ > s account for at least 99.99% of 
||a;'^||i. Finally, as a postprocessing step after termination, we try to improve the 
solution by solving the system restricted to columns indexed by S, similar to the 
"debiasing" step described in [51, Section II. I]. 

Note that in contrast to Cplex, the homotopy method and ISA are imple- 
mented in Matlab (version R2012a/7.14). Moreover, by default, Cplex ensures 
feasibility in the sense that the computed solution x obeys \\Ax — 6||oo < 10~^; 
from the respective convergence results, both the homotopy method and ISA 
will reach this level of feasibility after finitely many iterations. As a safeguard, 
we added an additional high-accuracy projection after regular termination. How- 
ever, this step was not required for the homotopy method, and only on a single 
instance for ISA (this induced additional running time and the time for the 
postprocessing step is incorporated in the times reported below). 

The first test uses a 1024 x 4096 Gaussian matrix, the second one a partial 
discrete cosine transform (DCT) matrix consisting of 512 randomly drawn rows 
of the 2048 x 2048 DCT matrix; all columns are normalized to unit Euclidean 
length. For both matrices, we constructed ten vectors a;* with sparsities ||a;*||o = 
i ■ m/10, i G {1,. . . , 10}, (rounded down to the next integer value). The nonzero 
entries are ±1 and each x^ is the known unique solution to the instance given by 
the respective matrix A and right hand side vector b := Ax\ where uniqueness 
was achieved by ensuring the "strong source condition" (see, e.g., [20]) by means 
of the methodology proposed in [32]. 

Figure 3 shows the running times (in seconds) and the £2-iiorm distances 
to the respective known optimal solution. As explained above, all solutions are 
feasible to within an ^oo-tolcrancc of 10~®. The experiments show that using 
adaptive approximate projections instead of the exact ones in ISA saves a con- 
siderable amount of time, as was to be expected. The achieved final accuracy 
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Fig. 3. Numerical experiments for Gaussian matrix ((a) and (b)) and partial DCT 
matrix ((c) and (d)), each with normalized columns, for varying solution sparsities. 



is almost always (nearly) the same. For the varying sparsity levels of the solu- 
tion, we see that all solvers struggle when the number of nonzero entries in the 
optimum exceeds about m/2: Cplex and the homotopy method still produce 
mostly accurate solutions but at the cost of a significant increase in the required 
solution times (note the logarithmic scales on the vertical axes), ISA on the other 
hand has a somewhat more stable runtime behavior, but loses accuracy when 
the solution is dense. 

Since in Compressed Sensing, the solutions encountered are typically very 
sparse, the interesting cases are those with sparsity (much) smaller than m/2. 
Clearly, for such sparse optimal solutions, ISA (with adaptive approximate pro- 
jections) is superior to Cplex and the homotopy implementation both in terms 
of accuracy and speed. Thus, these examples show the potential of ISA as a 
successful algorithm for CS sparse recovery. 
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6 Concluding remarks 

Several aspects remain subject to future research. For instance, it would be in- 
teresting to investigate whether our framework extends to (infinite-dimensional) 
Hilbert space settings, incremental subgradient schemes, bundle methods (see, 
e.g., [25,29]), or Nesterov's algorithm [42]. It is also of interest to consider how 
the ISA framework could be combined with error-admitting settings such as those 
in [52,41], i.e., for random or deterministic (non-vanishing) noise and erroneous 
function or subgradient evaluations. Some of the recent results in [41], which all 
require feasible iterates, seem conceptually close to our convergence analyses, so 
we presume a blend of the two approaches to be rather fruithil. It would also 
be of interest to investigate convergence behavior with other general notions of 
"adaptive approximate projections" , e.g., solving the projection problem with an 
approximation algorithm with additive or multiplicative performance guarantee. 

From a practical viewpoint, it will be interesting to see how ISA, or possi- 
bly a variable target value variant as described in Section 4.2, compares with 
other solvers in terms of solution accuracy and runtime. For the -minimization 
problem (48), we have seen in Section 5.2 that ISA promises to be an interesting 
candidate; an extensive computational comparison of various state-of-the-art ii- 
solvers, including (a more refined version of) our ISA implementation, can be 
found in [38]. An extensive test for convex expected value constraints, while 
beyond the scope of this paper, would be an interesting further line of work. 

Acknowledgments. We thank the anonymous referees for their numerous helpful 
comments which greatly helped improving this paper. 
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